In [1]:
#importing the necessary Python libraries and the dataset:

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

data = pd.read_csv("diamonds.csv")
data.head()
Out[1]:
Unnamed: 0 carat cut color clarity depth table price x y z
0 1 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 2 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 3 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 4 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 5 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75

This dataset contains an Unnamed column. I will delete this column before moving further:

In [2]:
data = data.drop("Unnamed: 0",axis=1)

Now let’s start analyzing diamond prices. I will first analyze the relationship between the carat and the price of the diamond to see how the number of carats affects the price of a diamond:

In [3]:
figure = px.scatter(data_frame = data, x="carat",
                    y="price", size="depth", 
                    color= "cut", trendline="ols")
figure.show()

We can see a linear relationship between the number of carats and the price of a diamond. It means higher carats result in higher prices.

Now I will add a new column to this dataset by calculating the size (length x width x depth) of the diamond:

In [4]:
data["size"] = data["x"] * data["y"] * data["z"]
data
Out[4]:
carat cut color clarity depth table price x y z size
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 38.202030
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 34.505856
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 38.076885
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 46.724580
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 51.917250
... ... ... ... ... ... ... ... ... ... ... ...
53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50 115.920000
53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61 118.110175
53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56 114.449728
53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74 140.766120
53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64 124.568444

53940 rows × 11 columns

Now let’s have a look at the relationship between the size of a diamond and its price:

In [5]:
figure = px.scatter(data_frame = data, x="size",
                    y="price", size="size", 
                    color= "cut", trendline="ols")
figure.show()

The above figure concludes two features of diamonds:

Premium cut diamonds are relatively large than other diamonds There’s a linear relationship between the size of all types of diamonds and their prices Now let’s have a look at the prices of all the types of diamonds based on their colour:

In [6]:
fig = px.box(data, x="cut", 
             y="price", 
             color="color")
fig.show()

Now let’s have a look at the prices of all the types of diamonds based on their clarity:

In [7]:
fig = px.box(data, 
             x="cut", 
             y="price", 
             color="clarity")
fig.show()

Now let’s have a look at the correlation between diamond prices and other features in the dataset:

In [8]:
correlation = data.corr()
correlation["price"].sort_values(ascending=False)
C:\Users\ahmed\AppData\Local\Temp\ipykernel_13456\1821979137.py:1: FutureWarning:

The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.

Out[8]:
price    1.000000
carat    0.921591
size     0.902385
x        0.884435
y        0.865421
z        0.861249
table    0.127134
depth   -0.010647
Name: price, dtype: float64

Diamond Price Prediction¶

Now, I will move to the task of predicting diamond prices by using all the necessary information from the diamond price analysis done above.

Before moving forward, I will convert the values of the cut column as the cut type of diamonds is a valuable feature to predict the price of a diamond. To use this column, we need to convert its categorical values into numerical values. Below is how we can convert it into a numerical feature:

In [9]:
data["cut"] = data["cut"].map({"Ideal": 1, 
                               "Premium": 2, 
                               "Good": 3,
                               "Very Good": 4,
                               "Fair": 5})

Now, let’s split the data into training and test sets:

In [10]:
#splitting data
from sklearn.model_selection import train_test_split
x = np.array(data[["carat", "cut", "size"]])
y = np.array(data[["price"]])

xtrain, xtest, ytrain, ytest = train_test_split(x, y, 
                                                test_size=0.10, 
                                                random_state=42)

Now I will train a machine learning model for the task of diamond price prediction:

In [11]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(xtrain, ytrain)
C:\Users\ahmed\AppData\Local\Temp\ipykernel_13456\2944638855.py:3: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().

Out[11]:
RandomForestRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor()

Now below is how we can use our machine learning model to predict the price of a diamond:

In [12]:
print("Diamond Price Prediction")
a = float(input("Carat Size: "))
b = int(input("Cut Type (Ideal: 1, Premium: 2, Good: 3, Very Good: 4, Fair: 5): "))
c = float(input("Size: "))
features = np.array([[a, b, c]])
print("Predicted Diamond's Price = ", model.predict(features))
Diamond Price Prediction
Carat Size: 1
Cut Type (Ideal: 1, Premium: 2, Good: 3, Very Good: 4, Fair: 5): 2
Size: 5
Predicted Diamond's Price =  [3341.25333333]

Summary¶

So this is how you can use your Data Science skills for the task of diamond price analysis and prediction using the Python programming language. According to the diamond price analysis, we can say that the price and size of premium diamonds are higher than other types of diamonds. I hope you liked this article on Diamond Price analysis and prediction using Python. Feel free to ask valuable questions in the comments section below.